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Abstract — Cascade classifiers are one of the most important 
contributions to real-time object detection. Nonetheless, there are 
many challenging problems arising in training cascade detectors. 
One common issue is that the node classifier is trained with a 
symmetric classifier. Having a low misclassification error rate 
does not guarantee an optimal node learning goal in cascade 
classifiers, i.e., an extremely high detection rate with a moderate 
false positive rate. In this work, we present a new approach 
to train an effective node classifier in a cascade detector. The 
algorithm is based on two key observations: 1) Redundant weak 
classifiers can be safely discarded; 2) The final detector should 
satisfy the asymmetric learning objective of the cascade architec- 
ture. To achieve this, we separate the classifier training into two 
steps: finding a pool of discriminative weak classifiers/features 
and training the final classifier by pruning weak classifiers which 
contribute little to the asymmetric learning criterion (asymmetric 
classifier construction). Our model reduction approach helps 
accelerate the learning time while achieving the pre-determined 
learning objective. Experimental results on both face and car 
data sets verify the effectiveness of the proposed algorithm. On 
the FDDB face data sets, our approach achieves the state-of- 
the-art performance, which demonstrates the advantage of our 
approach. 

Index Terms — Object detection, boosting, asymmetric pruning, 
asymmetric classification, feature selection, cascade classifier. 

I. Introduction 

Real-time object detection is a fundamental topic in com- 
puter vision due to its tremendous uses in many applications 
such as video surveillance, real-time human computer interac- 
tion, robotics, etc. |[T|-||4|. The task of object detection is to 
identify predefined objects in a given image using knowledge 
learned from pre-labeled objects. Among various real-time 
object detection algorithms, Viola and Jones' algorithm 1 1 1 is 
the most commonly adopted approach due to its effectiveness 
and efficiency. Their framework consists of two phases. The 
first phase discovers and learns discriminative features from 
a large set of feature pools (feature extraction). Extracted 
features are used to construct a classifier in the second phase 
(classifier learning). In (ij, the authors combined these two 
phases together through the use of AdaBoost. AdaBoost selects 
relevant features and at the same time constructs a strong 
classifier. 

Significant effort has been spent on improving the Viola and 
Jones' framework. One common technique is to post-adjust 

The authors are with The Australian Center for Visual Technologies, The 
University of Adelaide, SA 5005, Australia (e-mail: {paul.pais, chunhua.shen, 
anton.vandenhengel}@ adelaide.edu.au). Correspondence should be addressed 
to C. Shen. This work was in part supported by Australian Research Council 
Future Fellowship FT120100969. 



linear coefficients of weak classifiers selected by AdaBoost in 
order to introduce an asymmetric property into the cascade 
classifier for an effective rejection of negative patches in early 
nodes. Post-processing algorithms can be divided into four 
categories: (a) By tuning node thresholds during detector train- 
ing, e.g., traditional cascade classifier |[T|; (b) By tuning node 
thresholds after the entire cascade classifier has been learned, 
e.g., soft cascade |5|, optimized cascade |6|; (c) By tuning 
weak classifiers and weak classifiers' coefficients during the 
cascade detector training, e.g., the LAC classifier f7|. (d) By 
tuning weak classifiers and weak classifiers' coefficients after 
the entire cascade classifier has been learned, e.g., the joint 
cascade |8|. 

A cascade classifier consists of a set of node classifiers (see 
Fig. [T] for an illustration). It is very different from a standard 
classifier in that the overall detection rate (classification accu- 
racy on the positive data) can be approximately calculated as 
the product of the detection rate of each node classifier: 

DRovr = nD^^- 

k 

The overall false positive rate (classification error rate on the 
negative data) is the product of the false positive rate of each 
node: 

FPovr = n^P^- 

k 

Here k indexes the node classifier. These two equations are 
valid under the assumption that each node makes independent 
classification errors. From these two equations, it is easy to 
see that in order to achieve a high overall detection rate and 
a low overall false positive rate, each node classifier must 
achieve an extremely high detection rate and only a moderate 
false positive rate. For instance, if the design goal for each 
node is to have a detection rate of 99.5% and a false positive 
rate of around 50% and the cascade classifier has 22 nodes 
in total, then the overall performance is: DRovr ^90% and 
FPovr ^ 2 • 10~^. This is referred to as the node learning goal 
inQ. 

In this work, we introduce a new post-processing approach 
by pruning AdaBoost's weak classifiers during the course of 
detector training. The intuition behind our pruning approach is 
to remove less discriminative weak classifiers while focusing 
on the asymmetric learning objective of cascade classifiers. 
In short, we use a fast asymmetric pruning technique to train 
the node classifier that better meets the node learning goal and 
consequently improves the overall performance of the cascade. 

Pruning is a well-known technique widely used in super- 
vised learning such as feature selection. It reduces the size of 
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the model by removing model components that provide little 
or no discriminative power to classify instances. It has been 
applied in a decision tree to remove nodes which are less 
significant |9|. By using pruning, one is able to reduce the 
complexity of the final classifier as well as achieve a better 
predictive accuracy. 

Boosting has been a method of choice for many learning 
problems including visual object detectors. The algorithm 
constructs a strong classifier which consists of a linear combi- 
nation of weak classifiers. The training procedure of boosting 
is an iterative process. At each iteration, the algorithm selects 
the weak classifier which has minimal weighted error. Samples 
are reweighed based on its classification error in the previous 
round. The process continues until the maximum number of 
iterations is reached or no weak classifier can be added into the 
ensemble. One of the popular boosting algorithms is AdaBoost 
| [TQ| . Although AdaBoost has been commonly used in object 
detection, various researchers argue that AdaBoost is sub- 
optimal for training cascade classifiers |[7|, |[TT|, |T2| . A few al- 
ternative algorithms have been proposed to replace AdaBoost, 
e.g., asymmetric boosting |T3|-|T5| and GentleBoost |11|. 

Pruning ensemble classifiers obtained from AdaBoost is of 
interest for many reasons. Firstly, AdaBoost is popular due 
to its simplicity and efficiency. Weak classifiers' coefficients 
can be calculated in a closed form. Although training a face 
detector was reported to be slow in the original Viola and 
Jones' framework (training a complete cascade classifier takes 
4 weeks), a recent study reveals that it is possible to speed 
up this training time by caching feature values at the start 
of the AdaBoost training p\. Secondly, AdaBoost is a well 
studied method and has been shown to be effective for many 
classification problems. Thirdly, the final trained classifier 
is an ensemble classifier which is fast to compute during 
evaluation. Unfortunately, AdaBoost performs sub-optimally 
in terms of achieving the asymmetric node learning goal 
||7|. Furthermore, AdaBoost is an ensemble learning technique 
which uses a forward selection search strategy. The algorithm 
is short-sighted and might not produce a near optimal classifier 
p2| . Finally, AdaBoost reduces the training error rate by 
concentrating on examples that are difficult to classify. As a 
result, AdaBoost may select irrelevant weak classifiers if the 
training data are noisy or contain outliers. 

In this work, we propose to prune weak classifiers trained 
by AdaBoost by eliminating less relevant features from the 
candidate feature pool. To be more specific, we exclude weak 
classifiers which have a minimal impact on a class separation 
between positive and negative samples. The criterion is not 
only capable of eliminating redundant features but also able 
to exploit the asymmetric node learning goal. The resulting 
ensemble is a compact linear combination of weak classifiers 
which is fast to evaluate. To perform pruning, we evalu- 
ate AdaBoost's weak classifiers using greedy sparse linear 
discriminant analysis (GSLDA) |T6| , | [T7| . By combining 
AdaBoost and GSLDA, we are able to exploit the fast feature 
selection (via AdaBoost) and achieve the asymmetric node 
learning goal in the cascade architecture (via GSLDA). This 
is a novel application of GSLDA in real-time object detection. 

In summary, we use AdaBoost to select an over-complete 



input 
patch 



►true target 



non-target non-target 



non-target 



Fig. 1. The cascade classifier. The overall detection rate and false positive 
rate can be calculated using {TJ and j2}. An input image patch is classified 
as a true detection only when it passes all the node classifiers. 



weak classifier pool. At the second step, an asymmetric prun- 
ing method (here we use GSLDA) is then applied to remove 
less relevant weak classifiers against the asymmetric node 
learning criterion. In theory, other feature selection methods 
such as fast forward selection of Wu et al. [18J can be used 
to replace AdaBoost at the first step. 

The main contributions of the presented work can be 
summarized as follows. 

1) We propose an alternative method to train an ensemble 
classifier which leads to a further performance improve- 
ment. The approach can be applied to many cascade 
classification based applications. The core of the pro- 
posed method is a novel application of the fast GSLDA 
algorithm. 

2) We apply pruning to two well-known frontal face detec- 
tion data sets and car data sets. Better performance is 
observed over a few other cascade classifiers. On more 
challenging face data sets, the FDDB face data sets, our 
algorithm achieves state-of-the-art performance. 

The rest of the paper is organized as follows. Section |ll| 
briefly outlines related work on pruning and post-training in 



object detection. Section [IIl|introduces background concepts of 
boosting and GSLDA. We then propose our pruning approach 
to enhance object detection performance. Experimental results 
are presented in Section |IV] Finally, we conclude our paper 
in Section Ivl 

II. Related Work 

Over the past decade, the computer vision community has 
witnessed numerous success on real-time object detection. 
Most of these work extended the original work of Viola and 
Jones' real-time face detector. Viola and Jones' work consists 
of three major components: 1) The cascade classifier. The 
cascade classifier can be represented as a degenerate tree. It 
is designed to efficiently filter out negative patches in early 
nodes for real-time face detection. 2) AdaBoost. AdaBoost 
trains a strong classifier by selecting discriminative features 
from a pool of Haar-like features. 3) Integral images for fast 
computation of Haar-like features. 

In the literature, there are a few approaches that attempt 
to improve the work of Viola and Jones. In this section, we 
focus on those work which applies post-processing to the 
trained cascade classifier. By re-adjusting weak classifiers' 
coefficients and node thresholds, one can further improve the 
final performance of object detectors. In the rest of this section, 
we review existing work related to pruning and post-training 
algorithms. 
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Zhang and Viola view the object detection problem as 
multiple instance learning |19|. They proposed to combine 
multiple instance pruning with soft cascade |5 1. They first train 
a single boosted classifier on the entire data set. Instead of 
setting a node threshold using a simple heuristic rule, they set 
the node threshold so that at least one acceptable window will 
be retrained. It is demonstrated that this post-training simplifies 
the training procedure and yields an improvement compared 
to the traditional cascade classifier and soft cascade. Chen and 
Chen proposed a novel cascaded structure called Meta-stages 
(201. The algorithm appends additional classifiers, termed 
meta-stages, to the original boosted cascade. Meta-stages ex- 
ploit information from previous nodes of the cascade classifier. 
Li and Zhang argued that features selected by AdaBoost 
could be suboptimal since AdaBoost trains weak classifiers 
in a sequential forward selection manner | [2T| . The authors 
introduced a boosting variant known as FloatBoost. FloatBoost 
incorporates the idea of floating search into AdaBoost. The 
algorithm backtracks and examines the already added weak 
classifiers and discards the redundant ones. They show that 
FloatBoost needs fewer weak classifiers than AdaBoost to 
achieve a similar performance. 

Wu et al. argued that tuning the node threshold to achieve 
the asymmetric node learning goal is suboptimal for training 
the cascade classifier |[7|, p2| . They proposed to decouple the 
problem of feature selection and ensemble classifier design 
in order to introduce asymmetry. They proposed to use linear 
asymmetric classifier (LAC) for post-processing weak learn- 
ers' coefficients. LAC is optimal for the node learning goal 
under the assumption that the linear projection of negative 
samples' features is symmetric and the linear projection of 
positive samples' features follows a Gaussian distribution. 
LAC maximizes the detection rate while discarding 50% of 
negative samples. This objective criterion has proven to be 
effective for training the cascade classifier as later shown in 
(T2I . Wu et al. observed that in some cases, linear discriminant 
analysis (LDA) gives a better performance than LAC on face 
data sets. Shen et al. proposed greedy sparse LDA (GSLDA) 
as an alternative approach to train object detectors ^2]. 
They generate a set of discriminative features by training one 
weak classifier for each Haar-like features. GSLDA is applied 
to sequentially select best weak classifiers. The best weak 
classifier is the one that yields a maximal class separation 
when added to the current set. The major drawback of their 
technique is that the algorithm trains only one weak classifier 
feature for each Haar-like feature, similar to jTSj. Hence, 
there is room to improve their detection performance since 
the number of available features is limited. It is the work of 
(Tj, fT2| , p2| that has directly inspired our work here. 

Our pruning approach is different from existing approaches. 
Unlike FloatBoost |21|, where less discriminative weak clas- 
sifiers are sequentially removed at each boosting iteration 
(floating search), we perform backward elimination with the 
asymmetric node learning goal after we have completed 
training a boosted classifier. Doing so not only results in 
a significantly reduced training time but also yields a final 
classifier which satisfies the asymmetric node learning goal. In 
contrast, FloatBoost does not take the asymmetric learning into 



TABLE I 
Notation 



Notation 


Description 


N 


Number of training samples in each classifier 


M 


Number of low level features 


D 


Number of pixels for each training sample 


L 


Number of bins for histogram features 


T 


Number of features in the final classifier 


Ti 


Number of initial features to be selected 


T2 


Number of features to be discarded during pruning (Ti — T) 



account. The main purpose of FloatBoost is only to remove 
redundant weak classifiers. 

Our approach is also different from ||7|, p2| , where the 
authors re-train the coefficients of weak classifiers learned 
from AdaBoost. Since AdaBoost is greedy, i.e., choosing the 
weak classifier and the weak classifier's coefficient in order 
to cause the greatest reduction in the exponential loss, it can 
be short-sighted in choosing the best weak classifier at each 
iteration. Hence the final set of weak classifiers, as used in 
|7|, |22| , might not be optimal in order to achieve the node 
learning goal. This can be observed in Fig. [3] where the set of 
weak classifiers used in ||7|, | [22| is not optimal with respect 
to fulfilling the asymmetric node learning goal. 

Compared to 1 12], the size of our initial pool of weak clas- 
sifier features is not only smaller but also more discriminative 
than theirs. Hence, the training time is faster and the trained 
detector is more accurate. In the next section, we present the 
fast and effective pruning algorithm for training the visual 
object detector. 

III. The Approach 

Our approach can be broken down into two steps. In 
the first step, we perform a sequential forward search to 
learn a sufficiently large set of discriminative features. In the 
second step, we perform a sequential backward elimination to 
construct a more compact binary ensemble classifier. In this 
section, we first discuss boosting based visual features. We 
then briefly review the concept of post-training binary stump 
features with an asymmetric classifier. Finally, we propose 
our pruning based feature selection algorithm. For ease of 
exposition, symbols and their denotations used in this section 
are summarized in Table U 

A. Boosting based feature selection 

Given a training data consisting of N samples {(ic^, yi)}^Li 
where Xi G is a M dimensional feature vector of the i-th 
sample and yi G { — 1,1} denotes the class label of the i- 
th sample. Here any feature descriptors that map the original 
raw pixel features of D dimensions to M dimensions e.g., 
Haar-like features or SIFT features, can be applied. Our goal 
is to learn a prediction function that achieves the asymmetric 
node learning goal in the cascade classifier. In order to achieve 
this, we first transform the original M dimensional training 
data into another feature space, in which classes can be 
separated more easily. One possible transformation is to apply 
the sign(-) function to each input feature. The transformation 
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function can be written as, 



h{x^) = sign{p{x^ -e))j e[l,M], 



(3) 



where x^ is an input data at dimension j, ^ is a threshold, 6 G 
[min(ic-^), max(ic-^)], and — l,l}isa polarity indicating 
the direction of the step function. In other words, we generate a 
classifier which partitions the input space into two sets: px^ < 
pO and px^ > pO. Let us assume that each M dimensional 
feature vector has N distinct feature values, the maximum 
number of dimensions of the new feature space is MN. For 
vision tasks, e.g., face and car detection, the training sample in 
each node can be close to 10, 000 and the number of Haar-like 
features can be anywhere in the order of 10^, there would be 
more than one billion binary features to consider. Searching 
all possible subsets of features is computationally expensive 
and infeasible. Boosting can be used to select features in this 
extremely high dimension. 

Boosting is a well-known machine learning algorithm, 
which builds an additive model from a set of weak learners 
(231. There are many variants of boosting, e.g., AdaBoost pQ| , 
LogitBoost 1 24 1, BrownBoost |25|, LPBoost |26|, etc. The 
algorithm trains series of weak learners with updated sample 
weights. The weak learning algorithm is designed to select 
the single feature which best separates positive and negative 
examples. For each feature, the weak learner determines the 
optimal classification function, such that a minimum number 
of examples are misclassified. A weak classifier consists of 
a feature, /, a threshold 6, and a polarity, p, indicating the 
direction of the inequality. 



HxJ.p, 6) 



if pf{x) < pO 
otherwise. 



(4) 



where is a training example. The final decision rule is 
formed by linearly combining the set of hypotheses (weak 
learners) generated at each round with their weighted votes. 
The final prediction can be written as 



F{x) = sign ^atht{Q 



(5) 



where ht is the t-th weak learner at iteration t and at is the 
t-th coefficient computed by the boosting procedure. 

One of key advantages of applying boosting as a feature 
selection mechanism is the speed of learning. In the traditional 
feature selection, the algorithm would need to evaluate MN 
binary weak classifiers (assume feature values are distinct). 
On the other hand, boosting makes use of sample weights 
to compactly encode the dependency of previously selected 
features. These weights can then be used to evaluate a given 
weak classifier in a constant time. By applying boosting, we 
are able to efficiently select a subset of features, which are 
most discriminative for classification, from a very large pool 
of features. 

Boosting with decision stumps as weak classifiers combines 
two tasks simultaneously when training a classifier: selecting 
the subset of features and building the symmetric ensemble 
classifier. For training the cascade classifier with the asymmet- 
ric learning objective (e.g., 99% detection rate and 50% false 



positive rate), separating these two processes provides more 
flexibility. In the next section, we briefly review the concept 
of linear asymmetric classifier (LAC), which has been shown 
to be a better alternative in learning an ensemble classifier for 
the cascade framework. 

B. Post-processing with LAC 

Wu et al. proposed LAC as a post-processing step for 
training nodes in the cascade framework |7|. They post-trained 
a weighted vote of AdaBoost's weak classifiers using the 
asymmetric criterion. In their work, one of the conclusions is 
that LAC is guaranteed to reach an optimal solution in terms of 
the node learning goal under the assumption of Gaussian data 
distribution. In this section, we briefly review their approach, 
which motivates our proposed algorithm. 

Given a linear classifier /(x) = sign(ii;^cc — b). The 
objective of each node in cascade classifiers is to seek a {ii;, 6} 
pair which has a very high accuracy on the positive data, Xi, 
and moderate accuracy on the negative data, X2. This objective 
can be expressed as the following optimization problem. 



max Ft{w^xi > 6}, 
subject to Ft{w^X2 < b} = 0.5. 



(6) 



They made the following two assumptions to solve ([6]): a) 
uu^xi is Gaussian; b) w^X2 is symmetric. By assuming these 
two assumptions, their objective function can be simplified to 



(7) 



where /ii and 112 are the mean of positive and negative classes, 
respectively. Hi is the covariance matrix of positive classes. 
The form of ^ is similar to the LDA, which can be written 
as. 



max 



(8) 



where S2 is the covariance matrix of negative classes. The 
only difference between LAC and LDA is that the pooled 
covariance matrix, Si, is replaced by Ei +E2. (|7]) and ([8]) can 
be solved by eigen-decomposition and a closed-form solution 
for ^ and ([5]) can be derived as. 



and 



'^LDA = (^1 + ^2) ^(/il-/i2), 



(9) 



(10) 



respectively. It is important to note here that positive data, 
Xi, and negative data, a? 2, are simply the output of weak 
classifiers. Hence, the solution expressed in ^ can be used 
as a replacement for boosting coefficients and node rejection 
threshold. 

Nonetheless, LAC has several drawbacks. First, it relies on 
a limited set of features trained by AdaBoost. Shen et al. illus- 
trate that when training data is highly asymmetric, AdaBoost 
can select a set of irrelevant features to re-initialize sample 
weights |12|. When this happens, LAC can only suppress 
weights of irrelevant features by setting their coefficients to 
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be small. Second, LDA is shown to perform as well as LAC 
in object detection j7|, p2| . In the next section, we present an 
efficient approach to prune irrelevant weak learners based on 
GSLD A/sparse eigenvectors. The algorithm integrates sparsity 
into the LDA classifier so that we only keep an optimal set of 
weak learners while achieving the asymmetric node learning 
goal in cascade classifiers. 

C. An Efficient Pruning Algorithm Based on Sparse Eigenvec- 
tors 

In the previous section, we discussed LAC and LDA which 
have empirically shown to be better at handling the asymmetric 
node learning goal in cascade classifiers. In this section, we 
propose a more efficient feature search which finds a reduced 
set of hypotheses while satisfying ([6]). 

Pruning can be casted as a sparse representation problem. 
Sparse representation attempts to find a solution which uses 
only a small subset of original features. Sparse representation 
has been successfully applied to solve problems in many 
areas, e.g., signal compression |27|, image de-noising | [28| , 
variable selection |29|, face recognition |30|, learning face 
features [31], etc. Motivated by the important role of sparsity, 
we prune the selected set of weak classifiers by training the 
sparse algorithm. Our objective is to set a coefficient vector, 
w, with many zero elements, indicating that only few of weak 
classifiers actually participate in the final decision rule. In this 
section, we propose to apply a sparse LDA to remove a set 
of irrelevant features. Similar to LDA, the sparse LDA solves 
the maximal class-separation problem but with an additional 
sparsity constraint. The objective function of sparse LDA can 
be written as, 



max 



subject to 



tv^ SwW ' 
Card(ii;) 



(11) 



where Sb and correspond to the between-class and within- 
class covariance matrices, respectively. Card(-) counts the 
number of non-zero components and k is an integer set by 
the user. In our paper, we define the within-class covariance 
matrix as. 



5^=7Si + (l-7)S2, 



where Si and T>2 are covariance matrices of positive and 
negative classes, respectively. The parameter, 7, controls the 
weighted sum of covariance matrices between both classes. 
By setting 7 to 0.5, we have the LDA objective ^ and 
by setting 7 to 1, we have the LAC objective ([7]). In the 
experimental section, we conjecture that LDA is simply a 
regularized version of LAC. 

Due to the sparsity constraint in ([TT]), a closed-form solution 
to LDA, ([T0|, can no longer be used. ([TT]) can be solved 
using a branch-and-bound search to select a set of relevant 
features |p^. The algorithm finds an exact solution to the 
sparse problem. However, the algorithm is computationally 
expensive and is almost infeasible on large feature dimensions. 
Even with a good initialization, the branch-and-bound search 
takes more than two hours to solve a problem where the size 



of the original feature space is 40 and the number of non- 
zero components, k, is set to 20 |16|. In face detection, the 
number of Haar-like features can be more than 100, 000 and 
k can be as large as 200. Clearly, there is a need for a more 
efficient alternative solution. Two widely adopted approaches, 
to approximately solve the optimization problem with the 
sparsity constraint as in ([TT]), are forward and backward 
greedy algorithms. The algorithm sequentially selects a new 
feature/variable at each step to reduce the objective function. A 
forward selection has been commonly applied due to its effec- 
tiveness and efficiency. A shortcoming of forward selection is 
that the algorithm can never correct mistakes made in earlier 
steps. In order to remedy this situation, a backward greedy 
algorithm has been adopted. The idea is to train a full model 
and greedily remove one feature/variable at a time. In this 
paper, we adopt an efficient greedy approach proposed in fTTj . 
For our problem, the computation can be made very efficient 
as the objective of ( [TT] ) can be computed in a closed form as 
S~^b due to the rank-1 matrix being a simple outer- 
product, Sb = b^b. Therefore, the computational complexity 
is heavily determined by S~^. A naive matrix inversion would 
be computationally expensive and inefficient. Since the matrix 
Sw is sequentially appended or reduced by a single row and 
column, an efficient matrix inversion update algorithm can be 
exploitted p2| . 

Let be a square symmetric matrix of size t x t and 
assume that we have computed its inverse, If a 

vector, V G M^+^, is appended to such that = 
A' 



^(t+1) 



. The new augmented inverse ^ can 



be calculated efficiently from 



-au 
a 



(13) 



where u = (A*) "^i^(i:t) and a = 



U). 



Similarly, for backward greedy elimination, the new matrix 
inverse can be calculated by a simple rank-1 update 

as 



{A'-^)-^ = B - {s 



il:t- 



l))/<5(t), 



(14) 



where we partition the matrix inverse (A^) ^ as follows, 
^ ' Here we assume that we 



(12) (A^)-i = 



want to remove the last row and column of the matrix. Note 
that one would need to permute the row and column of the 
matrix if this is not the case. For a forward search, the greedy 
algorithm sequentially finds the suboptimal w by adding a 
new variable which yields the maximal eigenvalue, b^ S~^b. 
On the other hand, for a backward elimination, the algorithm 
finds the suboptimal w by sequentially discarding a variable 
which yields the minimal eigenvalue. The algorithm continues 
until the predefined number of elements are selected, hence 
the name of greedy sparse LDA (GSLDA). GSLDA is an 
excellent approach among other sparse algorithms due to its 
effectiveness and efficiency as shown in [T2| , [ [T6| . 

We illustrate the flowchart of our approach in Fig. [2] Our 
algorithm works as follows. We first train a pool of discrimi- 
native features using AdaBoost. We then prune selected weak 
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AdaBoost based forward feature selection 

One-dimensional features 
(e.g. i-iaar-like features) 



IVIulti-dimensional features 
(e.g. IHQG features) 



Fast feature 
computation 0(M) 



Train decision stumps 
O(MNIogN) 



Add tine best weak 
classifier 0(1) 



Fast feature 
computation 0(ML) 



Project R"" features to 
aline O(ML^) 



Train decision stumps 
O(MNIogN) 



Add the best weak 
classifier 0(1) 



Feature 
pruning 
(GSLDA) 



Initial calculation of sample means and covariance 
matrix inversion 0(Ti^) 



Update sample means and covariance matrix inversion 
for Ti features O(Ti^) 



Remove the least relevant weak classifier from the 
current set 0(Ti^) 



M mms^ classifier with a biased learning goal 
tlfitedtion rate with false positive rate of 50%) 



TABLE II 

A COMPUTATIONAL AND MEMORY COMPLEXITY FOR OUR PRUNING 
ALGORITHM. 



Feature acquisition (Haar-like) 
Feature acquisition (HOG) 
AdaBoost 
Pruning 

Total (Haar-like features) 
Total (HOG features) 



Time 



O(TiM) 
0(TiML3) 
0(TiMA^log A^) 
0(T2T3) 

0{TiM + TiMNlogN + T2 ) 
(Ti ML3 + Ti MN log + T2T^ ) 



Memory 



0(ND) 
O(NLD) 
0{N) 
0(T2) 

0(ND + N + T^) 
(D{NLD + N + T^) 



Not that (TT] also used GSLDA to perform feature selection 
on a set of Haar-like features. However, the set of stump 
features used in their work is much smaller compared to ours. 
In their work, the number of stump features is equal to the 
number of Haar-like features, i.e., stump features are generated 
by training a decision stump on each Haar-like feature with 
uniform sample weights. In our approach, AdaBoost searches 
entire stump spaces and keeps a set of potential stump features. 
Hence, the size of our feature space is much larger than the 
one they adopted. Compared to |T2 |, our approach is not only 
more discriminative but also faster to train. In our experiments, 
we train 200 ~ 500 weak classifiers using AdaBoost. These 
selected weak classifiers ensure an over-complete and optimal 
set of candidates for the asymmetric node learning goal. 



Fig. 2. Flowchart of the proposed pruning algorithm. 



Algorithm 1 The pruning algorithm. 
Input: 

• A set of examples {xi, m}, i = 1 • • • N; 

• The number of initial weak classifiers to be trained, Ti ; 

• The maximum number of weak classifiers for the given node, T; 
Output: 

• An ensemble classifier, F{x) = sign(J^J^-,^ Wjhj{x) — b), that 
best satisfies the asymmetric learning objective |6}; 

Initilaize: 

• t^O; 

• Initialize sample weights; 

1 while t < Ti (Selecting weak classifiers using AdaBoost) do 

2 1. Train a weak learner (e.g., a decision stump, (|4}, that results in 
the smallest misclassification error) on the training set; 

3 2. Add the best weak learner into the current set; 

4 3. Update sample weights based on AdaBoost; 

5 L 4. t ^ t + 1; 

6 while t > T (Pruning using GSLDA) do 

1. Remove the weak classifier that least satisfies the asymmetric 
node learning goal j6), using the GSLDA algorithm; 

2. t^t-l] 

9 Adjust the threshold value b such that F has a 50% false positive rate 
on the training set. 



classifiers by performing a sequential backward elimination. 
The set of features which less satisfies the objective criterion 
^ will be removed from our feature sets. Backward elimi- 
nation continues until the required ensemble size is reached 
or the predefined node learning goal is achieved, e.g., 99% 
detection rate and 50% false positive rate. Finally, we adjust 
the threshold of the node classifier such that they have 50% 
false positive rate on the training set. The final classifier will 
have a similar form as AdaBoost ([5]). We summarize the 
algorithm of our pruning approach in Algorithm [T] 



D. Time and Memory Complexity 

To analyze the complexity, we break our approach into 
three stages: feature acquisition, AdaBoost and pruning. We 
first analyze the complexity of acquiring low level features, 
e.g., Haar-like features 1 1 1 or histogram of oriented gradients 
(HOG) features [33| . One may also use features like covari- 
ance features |[3i|, ||35) or CENTRIST |36|. In this step, we 
pick features, which can be computed in linear time with 
the use of integral images |7| or integral histograms |j37j. 
Hence, this step costs 0(M) time for one-dimensional features 
and 0{ML) time for multi-dimensional features. For multi- 
dimensional features, we project computed features onto a line 
using Fisher Linear Discriminant Analysis (LDA). LDA can 
be efficiently solved by generalized eigenvalue decomposition. 
This additional step costs 0{ML^) where L is the number of 
dimensions (the total number of histogram bins for a block 
of HOGs). Hence, the feature computation step takes 0(M) 
time for one-dimensional features and O(ML^) time for multi- 
dimensional features (since 0{ML + ML^) e 0{ML^)). 
For memory complexity, we need to store integral images 
for each training sample. For one-dimensional features, each 
training sample has a memory complexity of 0{D) and the 
total memory complexity for N training samples is 0{ND). 
For multi-dimensional features, each training sample has a 
memory complexity of 0{LD) and the total memory com- 
plexity is 0{NLD). 

In the next step, we train weak classifiers known as decision 
stumps, (|4]). To train decision stumps, we have to find the 
optimal threshold, in which produces a minimal mis- 
classification error. For fast training of decision stumps, one 
can sort feature values and scan through all possible threshold 
values sequentially to update error rate of decision stumps 
|7|. This algorithm takes 0{N\ogN) for sorting and (D{N) 
for scanning. We ignore 0{N) since (D{N log N) is bigger 
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than 0{N). In each AdaBoost iteration, we need to train M 
decision stumps. Hence, this step takes 0(MA^log A^). Train- 
ing Ti iterations of AdaBoost would take 0(TiMA/' log AT). 
For memory complexity, we need to store sorted feature 
values, O(A'), and the output ensemble classifiers, 0(Ti). 
Hence, the total complexity is 0{N ^Ti) memory. In object 
detection, we often have N ^ Ti, which means AdaBoost 
requires approximately O(A^) memory. In summary, the entire 
AdaBoost training has a complexity of 0{TiMN\ogN) time 
and 0{N) memory. 

In the final step, we prune weak classifiers obtained from 
AdaBoost using backward elimination. Backward elimination 
starts with the full index set (Ti) and sequentially deletes 
the variable which is the least relevant until only T elements 
remain. To begin a pruning operation, we first compute sample 
means and covariances for both positive and negative samples. 
We then calculate the inverse of covariance matrices. This 
process requires 0(Tf). Next, we sequentially remove less 
discriminative features from the current set. Each backward 
elimination step costs 0(Tf) for matrix inversion update 
and 0(Tf ) for removing the least relevant feature from the 
current set. Since we perform this step T2 time^ the total 
computational cost of backward search is 0(T2Tf^+ ^2^1 ), 
which is 0(T2Tf ) time. For memory complexity, we need 
to store both sample means, 0(Ti), and sample covariances, 
0(Tf ). The total memory complexity is 0(Tf ). Hence, the 
computation cost of pruning has a total complexity of 0(T2Tf ) 
time and 0(Tf ) memory. It is important to point out here that 
M > Ti and AT > Ti in vision tasks. By not having M or 
N in the computational complexity of pruning, pruning does 
not take up majority of the training time. In the experimental 
section, we show that most of the training time is actually spent 
on bootstrapping difficult negative samples in latter cascade 
nodes. Table [II] summarizes the complexity in terms of time 
and memory of our approach. 

IV. Experiments 

A. Face Detection on a Single-node Detector 

The aim of our first experiment is to emphasize the per- 
formance difference between AdaBoost+LDA |7| and our 
pruning approach. In this experiment, we set the value of 7 to 
0.5 (LDA). We use the FDDB face detection benchmark data 
set p8| |. The data set consists of 5, 171 faces in 2, 845 images 
under varying conditions in unconstrained environments (faces 
in the wild). Here we use HOG features to train the face 
detector. We discard face samples which have a resolution 
less than 48 x 48 pixels. The rest of faces is scaled to 48 x 48 
pixels with 8 pixels additional border to preserve the contour 
information of faces. 9, 300 remaining faces are split into two 
sets. The first set contains 5, 000 faces and 5, 000 random non- 
faces for training. The second set contains 4, 300 faces and 
100, 000 random non-faces for evaluation. In our experiment, 
we use blocks with different scales (8x8 pixels to 48 x 48 
pixels) and various aspect ratios (1:1, 1:2, 2:1, 1:3 

is the number of features to be removed during pruning, i.e., T2 = 
Ti — T, where Ti is the number of initial features to be selected and T is 
the number of features in the node classifier. 



Performance of single-node detectors 




30 40 50 60 

Number of weak classifiers 

Coefficients of final learned weak classifiers 



70 



Pruning 



50 100 150 200 250 300 350 400 450 500 

index of weak classifiers 



Fig. 3. Detection performance (top) of a single-node classifier. Here 
"Pruning" is our approach and "LDA" is AdaBoost+LDA approach |7|. We 
measure the detection rate by setting the false positive rate on the test set 
to 50%. Our approach has a higher detection rate than AdaBoost+LDA of 
Wu et al. The bottom plot shows values of weak classifiers' coefficients for 
both methods. Our pruning approach selects very different weak classifiers 
compared with Wu et al. 






^I^S^ ^1^*3^ ^fl^^j^ 

^^^^ ^^^^ ^^^^^^^^ 



Fig. 4. Illustration of selected and discarded HOG blocks using the 
asymmetric node learning criterion. Top: Kept HOG blocks; Middle: Removed 
HOG blocks by pruning from the first 70 weak learners selected by AdaBoost; 
Bottom: HOG blocks that are kept by our pruning approach but not in the 
first 70 weak learners selected by AdaBoost. 



and 3 : 1). Each block is divided into 2x2 cells and the 
HOG in each cell is summarized into 9 bins |33|. Hence, 
36-dimensional features are generated for each block. An li- 
Sqrt normalization is applied to the feature vector. At each 
iteration, we randomly sample 25% of all possible blocks for 
training a weak classifier. For our approach, we use AdaBoost 
to select 500 weak classifiers in the first step, which is assumed 
to contain most discriminative weak classifiers for this task. 
Fig. [3] shows the detection rate of both algorithms by fixing the 



8 



MANUSCRIPT 



false positive rate to 50%. In other words, each algorithms are 
programmed to remove 50,000 non-faces. From the figure, 
our approach has a higher detection rate performance than 
AdaBoost+LDA ||7|. Our algorithm also uses a smaller number 
of weak classifiers. For example, to achieve a 99.4% detection 
rate and 50% false positive rate on test sets, our algorithm 
requires 40 weak classifiers (pruned from 500 weak classifies) 
while AdaBoost+LDA needs at least 70 weak classifiers. Fig.|3] 
also shows the value of weak classifiers' coefficients, w, of 
both algorithms. The x-axis represents the index of all 500 
candidate weak classifiers. The ?/-axis represents weights of 
weak classifiers. It can be observed that our approach selects 
very different weak classifiers compared with AdaBoost+LDA. 

We also illustrate HOG blocks that are selected and dis- 
carded using the proposed approach. Fig. |4] (top) shows five 
HOG blocks with the highest weak learners' coefficients which 
are kept by our approach (in the first 70 weak learners selected 
by AdaBoost). The middle row shows five less relevant HOG 
blocks (with the lowest weak learners' coefficients) which are 
removed by pruning from the set of first 70 selected weak 
learners by AdaBoost. The bottom row shows five HOG blocks 
kept by the pruning approach from the pool of 500 weak 
learners but not in the first 70 selected by AdaBoost. 

We observe that blocks which cover the lower part of the 
face, i.e., areas around cheeks, are often removed. On the other 
hand, HOG blocks located around the eyes and nose are more 
likely to be kept or selected by our pruning approach. Clearly, 
our pruning approach has a higher flexibility in choosing a set 
of discriminative weak classifiers than AdaBoost+LDA. As 
demonstrated in this experiment, the first 70 weak classifiers 
selected by AdaBoost is not necessarily the optimal set that 
meets the asymmetric node learning goal. Consequently, it also 
justifies the need of using AdaBoost to select a large over- 
complete set of weak classifiers. 

B. Face Detection on MIT-CMU 

In this experiment, we first evaluate our approach on face 
detection with different values of 7 ^12) . In this experiment, 
we set 7 G {0.25,0.5,0.75,1.0}. Our training set contained 
5,000 face patches and 5,000 initial non-face patches. The 
resolution of the training data is 24 x 24 pixels. Negative 
patches are collected from 10,000 background images. We 
used 16, 233 features sampled uniformly from the entire set 
of Haar-like features. We train 20 node classifiers. The number 
of weak classifiers in each node is 7, 15, 30, 30, 50, 50, 
50, 100, 120, 140, 160, 180, 200, 200, 200. In afl 
nodes, we adjust the threshold such that each node achieves 
50% false positive rate on the training data. For our pruning 
approach, we set Ti to 500. In other words, we collect 
bootstrapped non-face patches and train 500 weak classifiers 
using AdaBoost in each node. Note here that it is possible 
to set Ti to be larger. Doing so would guarantee that the 
initial set of features is over-complete. However, this also 
increases the overall computational time during training. Our 
cascade training algorithm terminates when the bootstrapping 
non-face image database is depleted. During evaluation, the 
test image is re-scaled repeatedly by a factor of 1.25 and 



Performance on MIT-CMU test sets 
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Fig. 5. Performance comparison on MIT+CMU face data sets. We evaluate 
our approach with different pruning parameters (top) and compare against 
several face detectors (bottom). 



scanned with a stride of 1 pixel in both directions. Post- 
processing similar to jTl is applied to merge final detection 
results. We construct receiver operating characteristic (ROC) 
curves by repeatedly removing nodes from the cascade to 
generate points with increasing detection and false positive 
rates. Results are shown in Fig. |5] Based on ROC curves, 
7 G [0.5, 1.0] perform similarly. We suspect that the LDA 
criterion (7 = 0.5) performs similarly to the LAC criterion 
(7 = 1) since LDA is simply a regularized version of LAC. 
In other words, if we assume that the covariance matrix of 
the negative data is approximately ul, i.e., T>2 ~ z^I, where u 
is a positive constant and I is an identity matrix. (V2\ can be 
written as. 



— ^1 



1 



■7 



7 



S2 = Si + ul. 



(15) 



In our experiments, we observe this assumption to be valid for 
the asymmetric node learning objective, i.e., the off-diagonal 
elements of 112 (correlation values) is often close to zero. Our 
conjecture is that negative data in each node are bootstrapped 
from a large pool of background images and likely to follow 
a uniform distribution. Hence, LDA is simply a regularized 
LAC. It can be argued that the distribution of negative data 
may not follow the uniform distribution in latter stages where a 
large number of background patches have already been filtered 
out. In this case, we conjecture that it may be best to use 
covariance information from both positive and negative data 
(7 < 1). This might explain why Wu et al. also observed that, 
in some cases, LDA gives a better performance than LAC 
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Node Performance on the Face Validation Set 

■ AdaBoost ■ AdaBoost + LDA Pmnning 




10 11 12 13 14 15 16 17 18 19 

Node Number 



Fig. 6. Independent node performance on validation sets for face detection. In most nodes, our approach results in a smaller false negative rate. 



TABLE III 

Comparison of training time and the average number of 
features evaluated per patch for different detectors. the 
average number of features used was obtained from test sets 
indicated. 



2-nd node 



4-th node 



8-th node 





MIT-CMU (Haar) 


FDDB (HOG) 


Train. 


# Evaluation 


Train. 


# Evaluation 


Pruning 
AdaBoost jT] 
AdaBoost+LDA (v] 


6hl0m 
5hl5m 
5h25m 


16.31 
13.75 
13.09 


4h30m 
3h30m 
3h40m 


15.07 
13.89 
13.88 



In the rest of our experiments, we set 7 to 0.5 (LDA). 

In the next experiment, we compare our approach with 
AdaBoost AdaBoost with the LDA post-processing (Ad- 
aBoost+LDA) (Tj, AdaBoost with a backward ehmination 
(FloatBoost) |[2T| and the GSLDA algorithm (12}. We compare 
the performance of five object detectors on the MIT+CMU test 
sets. All experimental settings remain the same as described 
previously, i.e., all five cascade classifiers are trained with 
the same number of haar-like features and the node learning 
goal of all five detectors are set to be the same. ROC curves 
are plotted in Fig. [5] Note here that the experimental result 
of other detectors are based on our own re-implementation. 
Experimental results demonstrate that pruning AdaBoost with 
the LDA criterion performs best. It outperforms the GSLDA 
object detector as it incorporates more relevant features. Our 
proposed classifier also outperforms FloatBoost as FloatBoost 
does not consider the asymmetry in the learning. 

Fig. [6] shows an independent node comparison on 4, 832 
test faces with different learning algorithms. We evaluate each 
node independently. From the figure, pruning has a smaller 
false negative rate (higher detection rate) than AdaBoost and 
AdaBoost+LDA in most nodes. In summary, our results clearly 
demonstrate the superior performance of pruning. Table |III| 
shows the approximate cascade training time and the average 
number of features evaluated per patch. Note that with the 
use of integral images and caching, training each node of 
cascade classifiers takes less than 5 minutes. We observe that 
most of the training time is spent on bootstrapping difficult 
negative samples in latter cascade nodes. Our experiments are 
performed on a server with 12-core AMD Opteron of 2.20 
GHz and 256 GB RAM. The code is implemented in C++ 
and OpenMP API for parallelized feature extraction, feature 
sorting and bootstrapping. 



12-th node 



16-th node 



20-th node 



II ilii 



Fig. 7. Weak classifiers' coefficients at each node. The x-axis represents the 
index of all 200 candidate weak classifiers. 



C. Challenging Face Detection Data Sets 

In this experiment, we evaluate our algorithm on more 
challenging face detection benchmark, FDDB database pSj . 
The data sets consist of faces under varying conditions in 
unconstrained environments. The database consists of 5, 171 
faces in 2, 845 images. Similar to the experiment on single- 
node detectors, we use HOG features to train a face detector. 
A cascade of 20 nodes (545 weak classifiers) is trained. The 
number of weak classifiers in each node is 4, 7, 10, 10, 12, 12, 
15, 15, 20, 20, 30, 30, 30, 40, 40, 40, 50, 50, 50, 60. To ensure 
a fair comparison, we used the same cascade structure and 
the same number of weak classifiers for all learning methods. 
During evaluation, we set the scale factor to 1.25 and the stride 
step to 4 pixels. 

Since our detector is based on the cascade classifier frame- 
work, we evaluate our algorithms only on discrete score ap- 
proach p8| . In this scheme, each detection is evaluated either 
as a match or non-match based on the ratio of overlapping 
areas. To be considered a correct detection, the area of overlap 
between the predicted bounding box and the ground truth 
bounding box must exceed 50%, using the PASCAL object 
detection criterion. Post-processing similar to 1 1 1 is applied to 
merge final detection results. We report our results based on 
the average of 10 split runs (the FDDB database provides 10 
groups of face data). At each run, we generate the positive 
training data by combining faces from 9 groups and use the 
face data in the remaining group for testing. We also mirror the 
positive training data. In total, there are approximately 8, 000 
training faces. For our pruning approach, we set Ti to 200. 



10 



MANUSCRIPT 



Faces In The Wild (FDDB) data sets 
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Fig. 8. Cascade performance on FDDB data sets, (a) The curve is generated 
by adding one cascade level at a time, (b) The curve is generated by using 
the FDDB evaluation code pSj . The results are the average of 10 splits. 



Fig. 8(a) shows ROC curves of our approach, AdaBoost and 
AdaBoost+LDA. In this experiment, we remove faces which 
have a resolution smaller than 48 x 48 pixels from our test sets. 
The curve is generated by adding one cascade level at a time. 
For example, the rightmost marker corresponds to the detec- 
tion rate and the number of false positives using 20 levels of 
cascade classifiers. Similar to the previous experiment, pruning 
performs better than other approaches. We also plot the value 
of weak classifiers' coefficients of our approach at the 2-nd, 
4-th, 8-th, 12-th, 16-th and 20-th node in Fig. [T] The x-axis 
represents the index of all 200 candidate weak classifiers and 
the y-Sixis represents weights of weak classifiers. We observe 
that our approach selects similar weak classifiers as AdaBoost 
in early stages of the cascade (a large number of non-zero 
coefficients appear close the first few indices in the second 
and fourth node). However, in later stages, the set of selected 
weak classifiers tends to spread out across 200 candidate weak 
classifiers. We also observe that weak learners' coefficients in 
later stages are more uniformly distributed. We suspect that 
a large number of negative training samples in later stages 
consist of difficult-to-classify background patches, i.e., the 
bootstrapping process has discarded a lot of easy-to-classify 
negative samples. As a result, weak learners' coefficients are 
more uniformly distributed as no single weak classifier can 



Fig. |8(b)| compares ROC curves of various face detectors. 
We use the FDDB evaluation software and their original 
ground- truth data to generate performance curves. The eval- 
uation program requires detection result's coordinates, width, 
height and the confidence score associated with the detection 
window. In our experiment, we merge multiple detection 
windows into a single detection window and calculate their 
average detection window positions. The confidence score is 
calculated from the average detection responses in the last 
node, i.e., X]J=i where {hj{-)}J^i and {aj}J^i are 
the set of weak classifiers and their coefficients in the last 



node. As the baseline, we use the face detector of Li et al. | [4Q| , 
the online approach using Gaussian process regression scheme 
(VJGPR) |4r|, the OpenCV implementation of Viola and 
Jones' face detector |[T|, the probabilistic based part detector 
|42l and the bounding box estimation based face detector 
143 J . From Fig. 8(b)| our detector achieves a comparable 
performance to the SURF face detector of Li et al. | [4Q| . 
However, our approach has a better detection rate when the 
number of false positives is greater than 200. At a small 
number of false positives, we observe that our system performs 
slightly worse than f40l . We suspect that this is related to how 
the confidence score is generated. [4QJ defines the confidence 
score as the sum of the detection probability of all nodes while 
our approach uses the confidence score from the last node. 
We suspect that our performance at the low false positive 
rate can be further improved by exploiting weak classifiers 
learned in previous nodes (similar to the soft cascade f5l and 
the multi-exit classifier 1 13 |). In addition, training another set 
of coefficients, which maximizes the area under the ROC curve 
(AUC), would be a better alternative in order to generate better 
confidence scores. We leave this as a future work. 

Fig. |9] shows an average independent node comparison on 
FDDB test sets of different detectors. We observe that pruning 
has a smaller false negative rate than AdaBoost and Ad- 
aBoost+LDA in every node. We observe that the improvement 
of our approach over AdaBoost+LDA at each node is more 
consistent in Fig. |9] than in Fig. [6] We suspect that this may be 
due to: (a) the use of 10 split runs, (b) FDDB database is much 
more challenging than the MIT-CMU test set and, hence, there 
is more room for performance improvement. Again, our results 
demonstrate the evidence that pruning can further enhance the 



final performance. Table |III| shows the approximate cascade 
training time and the average number of features evaluated 
per patch. Note that the training time of HOG features is 
faster than the training time of Haar-like features due to the 
use of OpenMP for parallelized feature extraction and weak 
classifier training. It is important to note here the difference 
between Fig. [3] and Table III Fig. [3] shows the detection 
performance of a single-node classifier while Table |ni| shows 
the average number of features evaluated per image patch 
during testing for the cascade classifier of 20 nodes. The 
threshold in Fig. [3] was chosen such that half non-faces in 
the test set will be correctly classified and the other half will 
be incorrectly classified (a false positive rate of 50% on the 
test set). From that experiment, we observe that our approach 
has a slightly higher detection rate than AdaBoost+LDA when 
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Node Performance on FDDB 

I AdaBoost ■ AdaBoost + LDA Pmnning 
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Fig. 9. Average independent node performance on FDDB face detection test sets. The results are the average of 10 splits. 




Fig. 10. Detection examples on challenging FDDB test sets. Despite a large 
amount of variation in poses and self occlusion, our system is still able to 
perform exceptionally well. 
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Fig. 11. Detection examples on Trellis data sets (39). The video contains 
501 frames of a person moving underneath a trellis with large illumination 
and pose changes. On this video, our algorithm runs at 5 frames per second 
using HOG features. 



both algorithms have the same number of weak classifiers. On 
the other hand, the threshold in Table |lll| was chosen such that 
half non-faces in the training set of that node will be passed to 
the next node classifier and the other half will be discarded (a 
false positive rate of 50% on the training set of the given node). 
From Table III we observe that each node of our approach 
removes less background patches than AdaBoost+LDA, i.e.. 



a higher average number of features evaluated per patch. In 
contrast, both AdaBoost and AdaBoost+LDA have a very 
similar average number of features evaluated per image patch. 
Our conjecture is that AdaBoost+LDA uses the same set of 
features as AdaBoost, therefore, it has a similar true negative 
rate as AdaBoost. In contrast, our approach selects a set of 
features based on the asymmetric node learning criterion. 



TABLE IV 

CPU TIME PERFORMANCE AND DETECTION RATES OF DIFFERENT 
DETECTORS ON THE "TRELLIS" VIDEO SEQUENCE (320 X 240 PIXELS 
RESOLUTION). THE SAME NUMBER OF WEAK CLASSIFIERS AND NODES 
ARE USED IN ALL DETECTORS. 



Algorithm 


FPS 


Det. rate 


# false pos. /image 


Pruning 


5.0 


26.0% 


0.0016 


AdaBoost 1 1 1 


5.2 


16.3% 


0.0002 


AdaBoost+LDA 17] 


5.0 


22.2% 


0.0014 



Fig. 12. A sample of UIUC car training images. 



This often leads to a better generalization performance than a 
cascade of AdaBoost+LDA but it can lead to a slightly higher 
number of evaluated features during test time. Fig. [T0| shows 
some detection examples on challenging FDDB data sets. 

In the next experiment, we use a test video "Trellis" ob- 
tained from 1 39 1 to test the real-time performance of different 
detectors. The video contains 501 frames of 320 x 240 pixels 
images. It contains a person moving underneath a trellis with 
large illumination changes and pose variations. We evaluate all 
10 detectors obtained from the previous experiment. During 
evaluation, the video is scanned with a stride of 4 pixels in 
both directions. We apply a scale ratio of 1.25 for multiple 
scale detection. We evaluate our detection using PASCAL 
criteria, i.e., the overlapping ratio between the detection and 
the ground truth must exceed 50% to be considered as the 
correct detection. We record the number of false positives, 
true detections and evaluation time obtained from 20 levels of 



cascade classifiers. Table |IV] shows our results by averaging 
the number of false positives, detection rate and evaluation 
time. We test our detectors using a standard desktop computer 
(Intel core i-7 CPU 930 with 12 GB memory). Fig. [TT] shows 
some detection examples of our approach. 
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Detection performance on EPFL car data sets 
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TABLE VI 

Average number of HOG features evaluated per patch. 



10 20 30 

Number of false positives 



50 



Fig. 13. Performance on 512 images of side view cars from EPFL data sets 
(44). 

TABLE V 

Performance on UIUC multi-scale car test sets 



Algorithm 


F-Measure 


Det. rate 


# false pos. 


Pruning 
AdaBoost [ill 
AdaBoost+CDA (?) 
CS-AdaBoost |45 1 
Agarwal et al. ^46j 


98.6% 
98.6% 
98.6% 
95.26% 
43.4% 


97.8% 
98.6% 
97.8% 
95.5% 
38.9% 


1 
2 
1 
9 

56 




Fig. 14. Detection results from the car show environment using AdaBoost 
(first column), AdaBoost-i-LDA (middle column) and our pruning approach 
(last column). Test data sets contain cars of type sports sedans, hatchbacks and 
station wagons. These car models are different from those used during training 
(mostly typical sedans) (Fig. [12) . Despite these differences, our detectors 
detect most cars correctly. 



Algorithm 


UIUC test sets 


EPFL test sets 


Pruning 
AdaBoost [1] 
AdaBoost+tbA (?) 


22.6 
24.6 
24.6 


20.3 
23.1 
23.1 



D. Car Detection 

Next, we conduct an experiment on car detection. We com- 
pare different detectors on the UIUC data sets f46l. Training 
sets consist of 550 mixed left and right profile views of cars 



with a resolution 40 x 100 pixels. Fig. 12 shows some random 
samples of UIUC car training images. We combine both left 
and right profile views, with their mirrored samples and train a 
single detector. Our training sets consist of 1, 100 car samples. 
Negative training images used are the same as those used in 
face experiments. The detector is evaluated on 108 multi-scale 
test images consisting of 139 cars. 

We compare our pruning approach to AdaBoost and Ad- 
aBoost+LDA. We evaluate our algorithm on HOG features. 
We define blocks with following scales (minimum of 4 x 4 
pixels and maximum of 40 x 100 pixels) and width-length 
ratios of 1 : 1, 1 : 2, 2 : 1, 1 : 3 and 3 : 1. Each block 
is divided into 2x2 cells, and HOG in each cell is divided 
into 9 bins. There are a total of 3, 801 blocks. An ^i-Sqrt 
normalization is applied to the feature vector [33 1. At each 
iteration, we randomly sample 25% of all possible blocks for 
training a weak classifier. We train a visual detector of 20 
cascade nodes. The number of weak classifiers in each node 
is 4, 7, 10, 10, 12, 12, 15, 15, 20, 20, 30, 30, 30, 40, 40, 
40, 50, 50, 50, 60. For our pruning approach, we set Ti to 
300. We follow the technique used in |1| to merge multiple 
detection windows. 

To evaluate our performance, we use the software provided 
along with the data set. An output is counted as a correct 
detection if it lies within 25% of the true object dimension 
in each direction. Only one detection window is counted as 
correct if two or more detection windows satisfy the criteria. 
We record the performance by F-measure, detection rate and 
the number of false detections. The F-measure is the weighted 
harmonic mean of precision and recall. Table |V| shows the 
performance of different detectors. We provide a baseline 
system of |46| and |45|. Note that the method of ||46j| and 
(45] were trained with only 500 negative patches so it can 
not be directly compared with our algorithms. Also, | [45| 
uses a single-scale test set. From the table, our evaluated 
algorithms achieve similar F-Measure and perform similarly 
on UIUC multi-scale test sets. In the next experiment, we 
test our detectors on more challenging EPFL car data sets 
Ozuysal2009Pose. We use 512 images of left and right profile 
poses from 20 car models. We evaluate our detection results 
against provided ground truths. We use PASCAL criteria 
for this experiment. Fig. [13] compares ROC curves of dif- 



ferent approaches. Our results clearly indicate that pruning 
can further improve the generalization performance of visual 
detectors. Fig. |IV-D| demonstrates some car detection results 
using different algorithms. It is quite interesting to observe that 
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our detector is able to detect car in the penultimate row but 
not in the last row. We suspect our detector fails to detect a car 
whose boot has a vertical shape. This is not surprising since 
this particular shape is not common in our training samples. 
We compare the average number of HOG features used in 



Table |Vl] For car data sets, the proposed pruning not only 
gives a higher detection rate but also performs faster than 
other evaluated algorithms (as it requires less number of HOG 
features on average). 

E. Discussion 

In this section, we briefly discuss our experimental results 
reported earlier and two important assumptions we made in 
our experiments. In the previous section, we evaluate our 
detector on two challenging object types, namely faces in 
unconstrained environments and unconstrained vehicle types. 
We observe that our approach outperforms other algorithms on 
both face and car data sets (Fig.[6j[9j[T3]and Table [V|). In terms 
of evaluation time, we observe that our approach is slightly 
faster than both AdaBoost and AdaBoost+LDA on car data 



sets (Table VI) but slightly slower than both AdaBoost and 
AdaBoost+LDA on face data sets (Table [III]). As previously 
discussed, both AdaBoost and AdaBoost+LDA have a similar 
average number of features evaluated per image patch. In 
contrast, our approach selects a set of features based on the 
asymmetric node learning criterion. These selected features 
often leads to a better generalization performance compared 
to those used in AdaBoost+LDA. However, they can result in 
a slightly less number of average features that needs to be 
extracted during testing (for car data sets) or a slightly higher 
number of average features per image patch during testing (for 
face data sets). 

It is important to note there that the success of our approach 
is based on two key assumptions. The first assumption is that 
the set of weak learners selected must be normally distributed. 
For object detection problems, Wu et al. demonstrate that 
this assumption is often valid in practice, i.e., w^h, is ap- 
proximately Gaussian for most instantiations of h |7|. Here 
h = [hi{-)^h2{-)r ' ' 7^t(-)]- Setting Ti to be large in our 
experiments ensures that the initial set of features is normally 
distributed as the mean of a sufficiently large number of 
independent random variables will be approximately normally 
distributed (the central limit theorem). The second assumption 
is that the covariance matrix, S^, is non-singular. In other 
words, the approach fails when the covariance matrix does 
not have a full rank and its inverse does not exist. In order 
to avoid this ill-posed problem, we can regularize the matrix 
Sw by Su, = + AI, where I is the identity matrix and A is 
the regularization parameter. This technique is also known as 
Tikhonov regularization. 

V. Conclusion 

We have presented a two-stage approach for training a visual 
object detector by separating the feature extraction process 
from constructing the asymmetric classifier. The learned en- 
semble better meets the asymmetric node learning and conse- 
quently improves the detection performance of the entire cas- 
cade. Experiments on various data sets show that the proposed 



method consistently outperforms the traditional framework of 
Viola- Jones 1 1 1 as well as the LDA post-processing algorithm 
of Wu et al. |7|. We have also demonstrated empirically that 
training the HOG cascade classifier using our approach is 
as competitive as state-of-the-art methods in the literature. 
On FDDB data sets, our approach substantially outperforms 
all other methods evaluated and achieves the state-of-the-art 
performance. In the future, we plan to investigate asymmetric 
learning objectives in multi-class problems. Another direction 
of future research would be to investigate whether parameter 
7, in ([12]), should be adjusted in each cascade node. 
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